Introduction

In this notebook, we'll take the Natural Language Toolkit (NLTK) for a spin, and apply it to James Joyce's Ulysses to see how Ulysses is assembled - in particular, Chapter 8, which Joyce called "Lestrygonians." We'll start from the lowest level - the individual letters - and work our way upwards, to n-grams, then words, then phrases, sentences, and paragraphs. To do this, we'll be using NLTK's built-in functions to tokenize, parse, clean, and process text.

We'll start by importing some libraries.


In [135]:
# In case we want to plot something:
%matplotlib inline 

from __future__ import division
import nltk, re

# The io module makes unicode easier to deal with
import io

We'll use NLTK's built-in word tokenizer to tokenize the text. There are many tokenizers available, both at the word and sentence level, but the word_tokenize() function is the no-hassle option.


In [136]:
######################################
# Words, words, words.
# Start by splitting the text into word tokens.

# Create a word tokenizer object.
# This uses io.open() because that deals with unicode more gracefully.
tokens = nltk.word_tokenize(io.open('txt/08lestrygonians.txt','r').read())

The variable tokens is now a list containing all the word-level tokens that resulted from word_tokenize():


In [137]:
print type(tokens)
print len(tokens)


<type 'list'>
15153

In [138]:
print tokens[:21]


[u'Pineapple', u'rock', u',', u'lemon', u'platt', u',', u'butter', u'scotch', u'.', u'A', u'sugarsticky', u'girl', u'shovelling', u'scoopfuls', u'of', u'creams', u'for', u'a', u'christian', u'brother', u'.']

Exploring the Text

Now that we have all the words from this chapter in a list, we can create a new object of class Text. This is a wrapper around a sequence of tokens, and is designed to explore the text by counting, providing a concordance, etc. We'll create a Text object from the list of tokens that resulted from word_tokenize(). The Text object also has some useful functions like findall(), to search for particular words or phrases using regular expressions.


In [139]:
def p():
    print "-"*20

# Start by creating an NLTK text object
text = nltk.Text(tokens)

p()
text.findall(r'<with> <.*> <and> <.*>')
p()
text.findall(r'<as> <.*> <as> <.*> <.*>')
p()
text.findall(r'<no> <\w+> <\w+>')
p()
text.findall(r'<\w+> <with> <the> <\w+>')
p()
text.findall(r'<see> <\w+> <\w+>')


--------------------
with meat and drink; with gold and still; with chummies and
streetwalkers; with porringers and tommycans; with lemon and rice;
with such and such
--------------------
as big as a collie; as witty as calling him; as big as the Phoenix; as
close as damn it
--------------------
no more about; no go in; no teeth to; no straight sport; no June has;
no ar no; no yes or
--------------------
out with the things; communicate with the outside; dress with the
braided; was with the red; supper with the Chutney; it with the hot;
Clerk with the glasses; meet with the approval; out with the Ward; all
with the job; walk with the band; weather with the chill; Riordan with
the rumbling; outs with the watch
--------------------
see him on; see the bluey; see them do; see the brewery; see it now;
see her in; see anything of; see produces the; see them library; see
if she; see a gentleman; see him look; see what he; see you across;
see the lines; see me perhaps

Concordance

The Text object also provides a concordance. When a word is passed to the concordance, it prints a line of context around each occurrence of the word:


In [140]:
p()
text.concordance('eye',width=65)
p()
text.concordance('Molly',width=65)
p()
text.concordance('eat',width=65)


--------------------
Displaying 7 of 7 matches:
ls writing something catch the eye at once . Everyone dying to kn
 things . Stick it in a chap’s eye in the tram . Rummaging . Open
 so older than Molly . See the eye that woman gave her , passing 
eher he has Harvey Duff in his eye . Like that Peter or Denis or 
him . Freeze them up with that eye of his . That’s the fascinatio
bling to the left . Mr Bloom’s eye followed its line and saw agai
. Kind of a form in his mind’s eye . The voice , temperatures : w
--------------------
Displaying 10 of 10 matches:
ering themselves in and out . Molly tasting it , her veil up . Si
us . Milly was a kiddy then . Molly had that elephantgrey dress w
 —No use complaining . How is Molly those times ? Haven’t seen he
 Only a year or so older than Molly . See the eye that woman gave
im . Goodbye . Remember me to Molly , won’t you ? —I will , Mr Bl
 . Kill me that would . Lucky Molly got over hers lightly . They 
ogether , their bellies out . Molly and Mrs Moisel . Mothers’ mee
s gives a woman clumsy feet . Molly looks out of plumb . He passe
rier in the City Arms hotel . Molly fondling him in her lap . O ,
 of those silk petticoats for Molly , colour of her new garters .
--------------------
Displaying 13 of 13 matches:
d you ever hear such an idea ? Eat you out of house and home . No
tnutmeal it tastes like that . Eat pig like pig . But then why is
weggebobbles and fruit . Don’t eat a beefsteak . If you do the ey
ke street . Here we are . Must eat . The Burton . Feel better the
nt . His gorge rose . Couldn’t eat a morsel here . Fellow sharpen
w sharpening knife and fork to eat all before him , old chap pick
the fidgets to look . Safer to eat from his three hands . Tear it
 back towards Grafton street . Eat or be eaten . Kill ! Kill ! Su
ese . Slaughter of innocents . Eat drink and be merry . Then casu
s out of the ground the French eat , out of the sea with bait on 
sburgs ? Or who was it used to eat the scruff off his own head ? 
gs of the flesh . Know me come eat with me . Royal sturgeon high 
wants job . Small wages . Will eat anything . Mr Bloom turned at 

Joyce's Ulysses is filled with colors. If we have a list of colors, we can pass them to the concordence one at a time, like so:


In [141]:
colors = ['blue','purple','red','green','white','yellow']#'indigo','violet']
for c in colors:
    p()
    text.concordance(c,width=65)


--------------------
Displaying 4 of 4 matches:
 and snapped the catch . Same blue serge dress she had two years
Breen in skimpy frockcoat and blue canvas shoes shuffled out of 
 , with wadding in her ears . Blue jacket and yellow cap . Bad l
eating eggs fifty years old , blue and green again . Dinner of t
--------------------
Displaying 1 of 1 matches:
No sound . The sky . The bay purple by the Lion’s head . Green b
--------------------
Displaying 7 of 7 matches:
 Sitting on his throne sucking red jujubes white . A sombre Y.M.C
 Hy Franks . Didn’t cost him a red like Maginni the dancing maste
 little room that was with the red wallpaper . Dockrell’s , one a
time flies , eh ? Showing long red pantaloons under his skirts . 
th . More power , Pat . Coarse red : fun for drunkards : guffaw a
sewage they feed on . Fizz and Red bank oysters . Effect on the s
ual . Aphrodis . He was in the Red Bank this morning . Was he oys
--------------------
Displaying 6 of 6 matches:
rg up in the trees near Goose green playing the monkeys . Mackere
mustard , the feety savour of green cheese . Sips of his wine soo
gs fifty years old , blue and green again . Dinner of thirty cour
iare . Do the grand . Hock in green glasses . Swell blowout . Lad
y purple by the Lion’s head . Green by Drumleck . Yellowgreen tow
 his teeth smooth . Something green it would have to be : spinach
--------------------
Displaying 9 of 9 matches:
is throne sucking red jujubes white . A sombre Y.M.C.A . young ma
et letters on their five tall white hats : H. E. L. Y. S. Wisdom 
mping the busk of her stays : white . Swish and soft flop her sta
faw and smoke . Take off that white hat . His parboiled eyes . Wh
ck feet that woman has in the white stockings . Hope the rain muc
s would with lemon and rice . White missionary too salty . Like p
 said . —Nothing in black and white , Nosey Flynn said . Paddy Le
black . Then passing over her white skin . Different feel perhaps
ent feel perhaps . Feeling of white . Postoffice . Must answer . 
--------------------
Displaying 4 of 4 matches:
kard . Well up : it splashed yellow near his boot . A diner , kn
dded under each lifted strip yellow blobs . Their lives . I have
n her ears . Blue jacket and yellow cap . Bad luck to big Ben Do
lly . But I know it’s whitey yellow . Want to try in the dark to

Word Counts

Chapter 8 of Ulysses, Lestrygonians, is named for the episode in Homer's Odyssey in which Odysseus and his crew encounter the island of the cannibal Lestrygonians. The language in the chapter very much reflects that. Here we use the count method of the Text class to count some meaty, sensory, organic words.


In [142]:
# Count a few words
def count_it(word):
    print '"'+word+'" : ' + str(text.count(word))

words = ['eyes','eye','mouth','God','food','eat','knife','blood','son','teeth','skin','meat','flies','guts']
for w in words:
    count_it(w)


"eyes" : 30
"eye" : 7
"mouth" : 14
"God" : 10
"food" : 10
"eat" : 9
"knife" : 7
"blood" : 7
"son" : 3
"teeth" : 2
"skin" : 3
"meat" : 6
"flies" : 3
"guts" : 2

Let's return to the results of word_tokenize() again, which were stored in the tokens variable. This contained about 15,000 words. We can use this as an all-inclusive wordlist, and filter out words based on certain criteria to get new wordlists. We can also use built-in methods for strings to process words. Here are two examples: one that converts all tokens to lowercase using a built-in method for strings, and one that extracts words with the suffix "-ed" using a regular expression.


In [143]:
# The tokens variable is a list containing 
# a full wordlist, plus punctuation.
# 
# We can make word lists by filtering out words based on criteria.
# Can use regular expressions or built-in string methods.

# For example, a list of all lowercase words using 
# built-in string methods to filter:
#lowerlist = [w for w in tokens if w.islower()]
lowerlist = [w.lower() for w in tokens]

# and use this to find all "-ed" suffixes
# using a regular expression:
verbed = [w2 for w2 in lowerlist if re.search('ed$',w2) and len(w2)>4]

This also allows us to do things like compile lists of words that are either colors themselves, or that have color words in them. To find unique words, we use a set object, a built-in type for Python, and add words that contain color words ('blue','green', and so on). The color red tends to add lots of non-color words so is excluded.


In [144]:
colors = ['orange','yellow','green','blue','indigo','rose','violet']#,'red']
mecolors = set()
_ = [mecolors.add(w.lower()) for c in colors for w in tokens if re.search(c,w.lower()) ]
print mecolors


set([u'blue', u'greenhouses', u'greens', u'penrose', u'orangepeels', u'bluecoat', u'orangegroves', u'greeny', u'yellow', u'bluey', u'yellowgreen', u'green', u'rose', u'blues', u'greenwich'])

Character Frequencies

To count character frequencies, we'll use a character iterator within a word iterator, and turn it loose on the entire chapter. The list containing each word converted to lowercase, lowerlist, will come in handy for this purpose.


In [145]:
# Make a dictionary that contains total count of each character
charcount = {}
tot = 0
for word in lowerlist:
    for ch in word:
        if ch in charcount.keys():
            charcount[ch] += 1
            tot += 1
        elif (re.match('^[A-Za-z]{1,}$', ch) is not None):
            charcount[ch] = 1
            tot += 1

# Make a dictionary that contains frequency of each character
charfrequencies = {}
keys = charcount.keys()
keys.sort()
for k in keys:
    f = charcount[k]/(1.0*tot)
    charfrequencies[k] = f*100

print "%s : %s : %s"%('char.','occurr.','freq.')
for k in charcount.keys():
    print "%s : %04d times : %0.2f %%"%(k,charcount[k],charfrequencies[k])


char. : occurr. : freq.
a : 4063 times : 7.39 %
c : 1230 times : 2.24 %
b : 0995 times : 1.81 %
e : 6448 times : 11.73 %
d : 2215 times : 4.03 %
g : 1438 times : 2.62 %
f : 1296 times : 2.36 %
i : 3700 times : 6.73 %
h : 3355 times : 6.10 %
k : 0701 times : 1.28 %
j : 0090 times : 0.16 %
m : 1487 times : 2.71 %
l : 2561 times : 4.66 %
o : 4347 times : 7.91 %
n : 3616 times : 6.58 %
q : 0056 times : 0.10 %
p : 1079 times : 1.96 %
s : 3799 times : 6.91 %
r : 3160 times : 5.75 %
u : 1592 times : 2.90 %
t : 4737 times : 8.62 %
w : 1260 times : 2.29 %
v : 0410 times : 0.75 %
y : 1234 times : 2.25 %
x : 0052 times : 0.09 %
z : 0045 times : 0.08 %

Comparing Character Frequencies to English

Using values of character frequencies for a broader sample of the language spectrum, we can determine how closely the use of the alphabet in Lestrygonians matches the usage of the alphabet in common language, and whether there's a there there. We'll start by importing a dictionary that contains a key/value set of frequencies for each letter.

The following big of code makes sure that we're only dealing with the alphabet, and that the keys match between the Lestrygonians character frequency dictionary and the English language character frequency dictionary that we just imported.


In [146]:
from English import EnglishLanguageLetterFrequency
ufk = set([str(k) for k in charfrequencies.keys()])
efk = set([str(k) for k in EnglishLanguageLetterFrequency.keys()])
common_keys = ufk.intersection(efk)

The next step is to compute how large of a deviation from "normal" the frequency of this character is. We use the formula for percent difference. If the character frequency of a given character is given by $y$, and the character frequency of that character in the Lestrygonians chapter is denoted $y_L$ and the character frequency of that character in the English language is given by $y_{E}$, then the formula for percent difference is given by:

$$ \text{Pct Diff} = \dfrac{y_{L} - y_{E}}{y_{E}} $$

In [147]:
frequency_variation = {}
for k in common_keys:
    clf = charfrequencies[k]
    elf = EnglishLanguageLetterFrequency[k]
    pd = ((clf-elf)/elf)*100
    frequency_variation[k] = pd

In [148]:
for k in frequency_variation.keys():
    print "%s : %0.2f %%"%(k,frequency_variation[k])


a : -8.97 %
c : -17.43 %
b : 21.49 %
e : -2.41 %
d : -6.72 %
g : 28.88 %
f : 2.51 %
i : -7.91 %
h : 3.10 %
k : 84.83 %
j : 63.74 %
m : 3.65 %
l : 17.07 %
o : 2.98 %
n : -5.34 %
q : -7.38 %
p : 7.86 %
s : 10.06 %
r : -4.50 %
u : 0.57 %
t : -5.30 %
w : 9.68 %
v : -32.80 %
y : 6.40 %
x : -44.35 %
z : 16.96 %

We see (and would expect) that there is more variation among the less common letters (higher sensitivity to variation).

However, there are some meaningful trends - if we pick out the letters that have a large positive percent difference, meaning they occur more in this chapter than they usually do in English, we find more harsh sounds: there are more b's, g's, j's and k's in this chapter than would be expected. These sounds are more guttural, cutting, blubbering letters, and fit with the character of Lestrygonians. Joyce's word selection in this chapter comes through even at the character frequency level.

Bigrams

The next thing we can do is use regular expressions to analyze the bigrams that appear in the chapter. Regular expressions allow us to seek out pairs of letters matching a particular pattern. Combined with list comprehensions, we can do some very powerful analysis with just a little bit of code.

For example, this one-liner iterates over each word token, grabs bigrams matching a particular regular expression, and uses the result to initialize a frequency distribution object, which enables quick and easy access to useful statistical information about the bigrams:


In [149]:
vowel_bigrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou]{2}',word))

The regular expression syntax '[aeiou]' will match any vowels, and the '{2}' syntax means, match exactly 2 occurrences of the pattern (two vowels in a row).


In [150]:
sorted(vowel_bigrams.items(), key=lambda tup: tup[1], reverse=True)


Out[150]:
[(u'ou', 554),
 (u'ea', 366),
 (u'oo', 304),
 (u'ee', 265),
 (u'ai', 229),
 (u'ie', 91),
 (u'io', 75),
 (u'ei', 65),
 (u'oa', 55),
 (u'oi', 51),
 (u'ui', 49),
 (u'oe', 46),
 (u'au', 44),
 (u'ue', 39),
 (u'eo', 36),
 (u'ia', 34),
 (u'ua', 25),
 (u'aa', 7),
 (u'eu', 6),
 (u'ii', 3),
 (u'ae', 2),
 (u'uu', 1),
 (u'uo', 1)]

By far the most common vowel-vowel bigram in Chapter 8 is "ou", with 554 occurrences, followed by "ea" (366 occurrences) and "oo" (304 occurrences) in a distant second and third place.


In [151]:
vc_bigrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou][^aeiou]',word))
sorted(vc_bigrams.items(), key=lambda tup: tup[1], reverse=True)[:20]


Out[151]:
[(u'in', 1157),
 (u'er', 850),
 (u'an', 623),
 (u'on', 540),
 (u'at', 516),
 (u'ed', 478),
 (u'es', 476),
 (u'ar', 455),
 (u'is', 439),
 (u'it', 429),
 (u'or', 414),
 (u'en', 392),
 (u'of', 346),
 (u'as', 326),
 (u'al', 322),
 (u'el', 285),
 (u'om', 277),
 (u'ow', 253),
 (u'et', 237),
 (u'ut', 221)]

In [152]:
cc_bigrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[^aeiou]{2}',word))
cc_bigrams.most_common()[:20]


Out[152]:
[(u'th', 1402),
 (u'ng', 631),
 (u'st', 452),
 (u'nd', 438),
 (u'll', 420),
 (u'nt', 210),
 (u'sh', 208),
 (u'rs', 172),
 (u'gh', 161),
 (u'ch', 161),
 (u'ck', 159),
 (u'ld', 151),
 (u'bl', 144),
 (u'wh', 141),
 (u'ss', 132),
 (u'rd', 113),
 (u'rt', 112),
 (u'ns', 111),
 (u'tt', 108),
 (u'rn', 98)]

More guttural, slicing, cutting sounds populate the bigrams appearing in this chapter: "in", "er", "an", "on", "at" dominate the consonant-vowel bigrams, while "th", "ng", "st", "nd", and "ll" dominate the consonant-consonant bigrams.

We can also look at a tabulated bigram table - this requires a different type of frequency distribution. We can create a conditional frequency distribution, which tabulates information like "what is the frequency of the second letter, conditional on the value of the first letter?" Create a conditional frequency distribution and use the tabulate() method:


In [153]:
cvs = [cv for w in lowerlist for cv in re.findall(r'[^aeiou][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()


     a    e    i    o    u 
-    1    0    1    6    0 
.    1    0    0    0    0 
b  137  207   53  120   76 
c  144  156   38  229   43 
d  116  222  119  145   41 
f   68  133  106  214   40 
g   93  177   61  138   43 
h  521 1476  486  251   53 
j   17    9    4   26   32 
k    7  218  116    1    6 
l  196  340  306  290   60 
m  189  358  108  153   68 
n   76  326  130  233   32 
p  124  156   97  128   61 
q    0    0    0    0   55 
r  180  608  221  269   65 
s  181  419  157  159   76 
t  178  385  246  451   71 
v   21  306   39   18    0 
w  293  147  214  149    0 
x    7    7    1    1    1 
y    7  112   20  153    3 
z    2   22    8    1    0 
—   12    0   19   10    2 
‘    0    1    0    0    0 
’    2    2    1    0    0 

In [154]:
cvs = [cv for w in lowerlist for cv in re.findall(r'[aeiou][b-df-hj-npr-tv-z]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()


     b    c    d    f    g    h    j    k    l    m    n    p    r    s    t    v    w    x    y    z 
a   66  133  160   50   81   12    1   75  322  129  623  106  455  326  516  120   57    6  157   11 
e   11   61  478   52   31   21    2    8  285  128  392   53  850  476  237   66   70   27  180    4 
i   24  156  191   90  152    0    2   61  195  140 1157   47  156  439  429   64    1   16    1    8 
o   42   51  124  346   18   10    0   72  157  277  540   80  414  158  194   60  253    3   30    4 
u   43   64   25   27   74    0    1    5  123   73  202  110  202  188  221    0    0    0    6    3 

n-grams

We can also look at some n-grams - combinations of n letters. Let's start by revisiting our search for two vowels, and modify the regular expression we used to look for three or more vowels:


In [155]:
vowel_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou]{3,}',word))
vowel_ngrams.most_common()[:20]


Out[155]:
[(u'uie', 7),
 (u'iou', 7),
 (u'eei', 6),
 (u'uee', 5),
 (u'eau', 4),
 (u'iiiiii', 1),
 (u'ieu', 1),
 (u'uea', 1),
 (u'aaaaaa', 1),
 (u'oooi', 1),
 (u'aaaaaaa', 1)]

We can also look for consonant-vowel-vowel or vowel-vowel-consonant combinations,


In [156]:
cvv_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[^aeiou][aeiou]{2}',word))
cvv_ngrams.most_common()[:20]


Out[156]:
[(u'you', 140),
 (u'loo', 98),
 (u'ree', 92),
 (u'sai', 79),
 (u'rea', 71),
 (u'hea', 57),
 (u'too', 54),
 (u'see', 53),
 (u'cou', 53),
 (u'rou', 48),
 (u'hou', 43),
 (u'tio', 41),
 (u'hei', 37),
 (u'goo', 36),
 (u'bou', 35),
 (u'rai', 30),
 (u'lea', 29),
 (u'yea', 27),
 (u'fee', 26),
 (u'qui', 25)]

In [157]:
vvc_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou]{2}[^aeiou]',word))
vvc_ngrams.most_common()[:20]


Out[157]:
[(u'out', 137),
 (u'ear', 96),
 (u'our', 87),
 (u'aid', 83),
 (u'eat', 70),
 (u'ood', 69),
 (u'ain', 68),
 (u'oun', 62),
 (u'ion', 62),
 (u'eet', 57),
 (u'oul', 55),
 (u'ook', 55),
 (u'ead', 53),
 (u'oom', 52),
 (u'ous', 42),
 (u'oug', 41),
 (u'een', 36),
 (u'eir', 36),
 (u'ies', 29),
 (u'eas', 29)]

In [158]:
vcv_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou][^aeiou][aeiou]',word))
vcv_ngrams.most_common()[:20]


Out[158]:
[(u'ere', 119),
 (u'ose', 94),
 (u'ome', 94),
 (u'one', 88),
 (u'ave', 81),
 (u'are', 64),
 (u'ine', 61),
 (u'ike', 59),
 (u'eve', 54),
 (u'ate', 53),
 (u'ive', 53),
 (u'use', 50),
 (u'ake', 46),
 (u'ove', 46),
 (u'eye', 44),
 (u'ame', 42),
 (u'ime', 38),
 (u'ide', 37),
 (u'ite', 35),
 (u'ice', 35)]

Some interesting results - "you", "out", and "ere" are the most common combinations of two vowels and a consonant, followed by "loo", "ear", "ose", and "ome". Plenty of o's in those syllables.

Creating an Index of n-grams

It's interesting to pick up on some of the patterns in bigrams and n-grams that occur in Lestrygonians, but how can we connect this better to the text itself? This is still a bit too abstract.

NLTK provides an Index object, which allows you to create a custom index with a list of words. Much like an index collects a variety of subjects under letter headings, and subject sub-headings, an Index object allows you to group information, but more powerfully - based on a custom criteria. This is similar to a Python dictionary type.

In this case, we'll cluster words based on the n-grams that they contain, so that we can look up all the words that contain a particular n-gram.


In [159]:
cc_ngram = [(cc,w) for w in lowerlist for cc in re.findall('[^aeiou]{2}',w)]
cc_index = nltk.Index(cc_ngram)

cv_ngram = [(cv,w) for w in lowerlist for cv in re.findall('[^aeiou][aeiou]',w)]
cv_index = nltk.Index(cv_ngram)

print "The following words were found using the bigram index."
print "The words are listed in order of appearance in the text."

p()
print "Bigram: pp"
print cc_index['pp']
p()
print "Bigram: bb"
print cc_index['bb']
p()
print "Bigram: pu"
print cv_index['pu']
p()
print "Bigram: ki"
print cv_index['ki']


The following words were found using the bigram index.
The words are listed in order of appearance in the text.
--------------------
Bigram: pp
[u'pineapple', u'stopper', u'pepper\u2019s', u'kippur', u'flapping', u'flapping', u'apples', u'apples', u'applewoman', u'flapping', u'appetite', u'suppose', u'dripping', u'lapping', u'happy', u'happier', u'stopped', u'supperroom', u'appearance', u'supper', u'happy', u'happy', u'chipped', u'snapped', u'gripped', u'approval', u'sloppy', u'supposed', u'suppose', u'happy', u'sourapple', u'opposite', u'apply', u'approval', u'pineapple', u'happier', u'happy', u'gripped', u'sloppy', u'sopping', u'sippets', u'suppose', u'souppot', u'kippur', u'sipped', u'smellsipped', u'kipper', u'pepper', u'ripped', u'dropping', u'nipples', u'suppose', u'sipping', u'hiccupped', u'lapped', u'upper', u'supper', u'tapping', u'opposite', u'suppose', u'tapped', u'tapping', u'suppose', u'quopped']
--------------------
Bigram: bb
[u'lobbing', u'bobbed', u'bobbob', u'rabbitpie', u'tubbing', u'shabby', u'stubbs', u'abbey', u'cobblestones', u'rubble', u'weggebobbles', u'ribbons', u'cabbage', u'cabbage', u'cabbage', u'bubble', u'wobbly', u'rabbi', u'cabbage', u'pebbles', u'dribbling', u'cobblestones', u'slobbers']
--------------------
Bigram: pu
[u'put', u'kippur', u'pudding', u'puffball', u'puke', u'purchase', u'put', u'jampuffs', u'pungent', u'purefoy', u'pull', u'popular', u'put', u'pugnosed', u'purefoy', u'put', u'public', u'pumpkin', u'pudding', u'punch', u'put', u'pupil', u'purefoy', u'squarepushing', u'putting', u'republicanism', u'purefoy', u'put', u'puffed', u'public', u'put', u'octopus', u'octopus', u'homespun', u'pursue', u'purty', u'pursued', u'pushed', u'pungent', u'pub', u'puzzle', u'kippur', u'puts', u'pure', u'putting', u'pub', u'put', u'pungent', u'put', u'putting', u'pulse', u'purple', u'pulp', u'put', u'pursed', u'put', u'publichouse', u'pulling', u'pulled', u'purse', u'put']
--------------------
Bigram: ki
[u'king', u'sucking', u'kidney', u'kitchen', u'thinking', u'kippur', u'drinking', u'looking', u'looking', u'king', u'kind', u'kino\u2019s', u'kinds', u'thinking', u'walking', u'skin', u'skilly', u'kiddy', u'looking', u'priestylooking', u'linking', u'skirts', u'raking', u'taking', u'talking', u'walking', u'taking', u'taking', u'thinking', u'kids', u'skimpy', u'drinking', u'thinking', u'kill', u'pumpkin', u'knocking', u'making', u'looking', u'talking', u'buckingham', u'kicked', u'walking', u'dickinson', u'taking', u'stockings', u'kind', u'making', u'thinking', u'looking', u'king\u2019s', u'walking', u'asking', u'walking', u'kinsella', u'skirts', u'drinking', u'baking', u'stockings', u'sticking', u'stockings', u'kissed', u'creaking', u'\u2014kiss', u'napkin', u'napkin', u'working', u'king', u'picking', u'kill', u'kill', u'kitchen', u'drinkingcup', u'smokinghot', u'looking', u'kippur', u'kipper', u'sucking', u'napkin', u'taking', u'kissing', u'mawkish', u'kindled', u'kitchen', u'killiney', u'making', u'moooikill', u'kish', u'kissed', u'mawkish', u'walking', u'kissed', u'kissed', u'kissed', u'kissed', u'kissed', u'speaking', u'drinking', u'stoking', u'taking', u'drinking', u'suckingbottle', u'kilkenny', u'slaking', u'kind', u'kind', u'skin', u'skin', u'walking', u'falkiner', u'cracking', u'sticking', u'kildare', u'making', u'looking', u'looking', u'looking']

If we revisit the most common trigrams, we can see which words cuased those trigrams to appear so often. Take the "you" trigram, which was the most common at 140 occurrences:


In [160]:
cvc_ngram = [(cv,w) for w in lowerlist for cv in re.findall(r'[^aeiou][aeiou][aeiou]',w)]
cvc_index = nltk.Index(cvc_ngram)

print "Number of words containing trigram 'you': " + str( len(cvc_index['you']) )
print "Number of unique words containing trigram 'you': " + str( len(set(cvc_index['you'])) )
print "They are:"
for t in set(cvc_index['you']):
    print t


Number of words containing trigram 'you': 140
Number of unique words containing trigram 'you': 12
They are:
him—you
—you’re
you’ve
you’ll
youths
young
you’re
youth
you’d
you
yours
your

It's also more convenient to create an Index that looks at all trigrams, instead of just consonant-vowel-consonant or vowel-consonant-vowel. Modifying the regular expression:


In [161]:
aaa_ngram = [(cv,w) for w in lowerlist for cv in re.findall(r'[A-Za-z]{3}',w)]
aaa_index = nltk.Index(aaa_ngram)

def print_trigram(trigram):
    print "Number of words containing trigram '"+trigram+"': " + str( len(aaa_index[trigram]) )
    print "Number of unique words containing trigram '" + trigram + "': " + str( len(set(aaa_index[trigram])) )
    print "They are:"
    trigram_fd = nltk.FreqDist(aaa_index[trigram])    
    for mc in trigram_fd.most_common(): 
        print "%32s : %4d"%(mc[0],mc[1])

p()
print_trigram('one')

p()
print_trigram('can')


--------------------
Number of words containing trigram 'one': 53
Number of unique words containing trigram 'one': 9
They are:
                             one :   35
                          no-one :    5
                       curbstone :    5
                            —one :    2
                       twentyone :    2
                          throne :    1
                       woebegone :    1
                       funnybone :    1
                            ones :    1
--------------------
Number of words containing trigram 'can': 30
Number of unique words containing trigram 'can': 8
They are:
                             can :   11
                           can’t :    9
                            cane :    5
                          canvas :    1
                       cannibals :    1
                          cannon :    1
                           canny :    1
                      canvassing :    1

Words, words, words

NLTK also provides tools for exploring the text at the word level. Let's start by looking at word lengths:


In [162]:
import numpy as np
sizes = [len(word) for word in lowerlist]
mean_size = np.mean(sizes)
var_size = np.std(sizes)
print "Mean word size: " + str(mean_size)
print "Std. dev. in word size: " + str(var_size)


Mean word size: 3.8366000132
Std. dev. in word size: 2.37301284681

To compose Lestrygonians, Joyce's choice of words tended toward words containing more harsh consonants like b's, g's, j's, and k's; shorter word lengths (in keeping with the peristaltic nature of the chapter's language, masticating words, passing them down the mental esophagus one bit at a time (2-6 letters).

What about the percentage of vowels and consonants occurring in the text?

This should be possible with a one liner. We'll have a nested for loop - one loop over all lowercase word tokens, and one over all letters in the word token. Then we'll count up the number of matches of exactly 1 instance of [aeiou] and the number of matches of exactly 1 instance of letters that are not aeiou (note that [^aeiou] will also count punctuation, we need to be careful), and that will give us our letter count.


In [163]:
# Count consonants and vowels
consonants = [re.findall('[b-df-hj-np-tv-z]',w) for w in lowerlist]
n_consonants = sum([len(e) for e in consonants])
vowels = [re.findall('[aeiou]',w) for w in lowerlist]
n_vowels = sum([len(e) for e in vowels])

print "Lestrygonians contains %d consonants and %d vowels, ratio of vowels:consonants is %0.2f"%(n_consonants,n_vowels,n_vowels/n_consonants)


Lestrygonians contains 34816 consonants and 20150 vowels, ratio of vowels:consonants is 0.58

We also see that interesting phenomena, whereby text is still partly or mostly comprehensible, even with all of the vowels removed:


In [164]:
''.join([''.join(t)+" " for t in consonants[:145]])
#for t in consonants[:21]:
#    print ''.join(t)


Out[164]:
u'pnppl rck  lmn pltt  bttr sctch   sgrstcky grl shvllng scpfls f crms fr  chrstn brthr  sm schl trt  bd fr thr tmms  lzng nd cmft mnfctrr t hs mjsty th kng  gd  sv  r  sttng n hs thrn sckng rd jjbs wht   smbr ymc  yng mn  wtchfl mng th wrm swt fms f grhm lmns  plcd  thrwwy n  hnd f mr blm  hrt t hrt tlks  bl  m  n  bld f th lmb  hs slw ft wlkd hm rvrwrd  rdng  r y svd  ll r wshd n th bld f th lmb  gd wnts bld vctm  brth  hymn  mrtyr  wr  fndtn f  bldng  scrfc  kdny brntffrng  drds '

Let's revisit that text object from earlier. Recall we created it using the code:

text = nltk.Text(tokens)

The Text object has made a determination of what words are similar, based on their appearance near one another in the text. We can obtain this list using the similar() method. The results are interesting to consider:


In [165]:
text.similar('woman')


run suckingbottle bull feast cheque night collation dream

In [166]:
text.similar('man')


nannygoat provinces eating penny bed walk person stick chat vacuum
marketnet bit day

Notice the woman-night and man-day association, and the woman-bull and man-nannygoat association. Some curious things here...

Because of the way the Text class is implemented, these are printed directly to output, instead of being returned as a list or something convenient. See the NLTK source for details. But we can fix this by extending the Text class to make our own class, called the SaneText class, to behave more sanely:


In [167]:
##################################################
# This block of code is extending an NLTK object
# to modify its behavior.
# 
# This creates a SaneText object, which modifies the similar() method
# to behave in a sane way (return a list of words, instead of printing it
# to the console).
#
# Source: NLTK source code. 
# Modified lines are indicated.
# http://www.nltk.org/_modules/nltk/text.html#ContextIndex.word_similarity_dict

from nltk.compat import Counter
from nltk.util import tokenwrap

class SaneText(nltk.Text):

    def similar(self, word, num=20):
        """
        This is copied and pasted from the NLTK source code,
        but with print statements replaced with return statements.
        """

        if '_word_context_index' not in self.__dict__:
            #print('Building word-context index...')
            self._word_context_index = nltk.ContextIndex(self.tokens,
                                                    filter=lambda x:x.isalpha(),
                                                    key=lambda s:s.lower())

#        words = self._word_context_index.similar_words(word, num)

        word = word.lower()
        wci = self._word_context_index._word_to_contexts
        if word in wci.conditions():
            contexts = set(wci[word])
            fd = Counter(w for w in wci.conditions() for c in wci[w]
                          if c in contexts and not w == word)
            words = [w for w, _ in fd.most_common(num)]
            
            #######
            # begin changed lines
            return tokenwrap(words)
        else:
            return u''
            # end changed lines
            ######
#
# Done defining custom class
################################################

In [168]:
sanetext = SaneText(tokens)

def similar_words(w):
    print "%32s : %s"%( w, sanetext.similar(w) )

similar_words('you')
similar_words('bread')
similar_words('beer')
similar_words('food')
similar_words('blood')
similar_words('meat')
similar_words('grave')
similar_words('right')
similar_words('priest')
similar_words('devil')
similar_words('coat')
similar_words('dress')
similar_words('life')
similar_words('babies')
similar_words('lamb')
similar_words('king')
similar_words('cut')
similar_words('fork')


                             you : him me all i cross soap how found bread fizz
                           bread : things you fizz tooth
                            beer : his
                            food : veins on times
                           blood : stream presence house broth church ends dead provost stings skin
scrapings stare closes conversion gnaw top tip somethings wings busk
                            meat : chummies lemon gold porringers such pleasure
                           grave : lees fire menu stooled mater bench cobblestones window curbstone
plates world river minute meet
                           right : allusion clock hasty ballastoffice completely belly night
                          priest : thing wit taxes
                           devil : lamb stopper carver rum fascination spring confession gaff missus
curves brain rightabout lines northwest way sky world left arm ground
                            coat : babies
                           dress : notice house remark hand pocket glass guard
                            life : cutlet was
                          babies : coat
                            lamb : king devil northwest curves
                            king : lamb
                             cut : has ate said on
                            fork : walked tommycans

In [169]:
similar_words('he')
similar_words('she')
similar_words('what')
similar_words('God')


                              he : she not i too morning
                             she : he even what i morning they multiply
                            what : would it she multiply messiah if
                             God : see publichouse

Some curious and unexpected connections here - particularly, "God and "publichouse,"

Part of Speech

Let's tag each sentence of the text with its part of speech (POS), using NLTK's built-in method (trained on a large corpus of data).

NOTE: It's important to be very skeptical of the part of speech tagger's results. The following does not focus on whether the parts of speech being tagged are correct, as that leads to other topics involving multiple words. Here, we show how to analyze the results of a part of speech tagger, not how to train one to be more accurate.

We can tag the parts of speech with the nltk.pos_tag() method. This is a static method that results in a list of tuples. Each tuple represents a word and its part of speech tag. The sentence

I am Jane.

would be tagged:

I (PRON) am (VERB) Jane (NOUN) . (.)

and would be stored as the tuple:

[('I',    'PRON'),
 ('am',   'VERB'),
 ('Jane', 'NOUN'),
 ('.',    '.')]

In [170]:
print nltk.pos_tag(['I','am','Jane','.'],tagset='universal')


[('I', u'PRON'), ('am', u'VERB'), ('Jane', u'NOUN'), ('.', u'.')]

We can either pass it a list (like the result of the nltk.word_tokenize() method, and our tokens variable above):


In [171]:
p()
print type(tokens[:21])
p()
print tokens[:21]
p()
print nltk.pos_tag(tokens[:21],tagset='universal')


--------------------
<type 'list'>
--------------------
[u'Pineapple', u'rock', u',', u'lemon', u'platt', u',', u'butter', u'scotch', u'.', u'A', u'sugarsticky', u'girl', u'shovelling', u'scoopfuls', u'of', u'creams', u'for', u'a', u'christian', u'brother', u'.']
--------------------
[(u'Pineapple', u'NOUN'), (u'rock', u'NOUN'), (u',', u'.'), (u'lemon', u'ADJ'), (u'platt', u'NOUN'), (u',', u'.'), (u'butter', u'NOUN'), (u'scotch', u'NOUN'), (u'.', u'.'), (u'A', u'DET'), (u'sugarsticky', u'ADJ'), (u'girl', u'NOUN'), (u'shovelling', u'VERB'), (u'scoopfuls', u'NOUN'), (u'of', u'ADP'), (u'creams', u'NOUN'), (u'for', u'ADP'), (u'a', u'DET'), (u'christian', u'ADJ'), (u'brother', u'NOUN'), (u'.', u'.')]

or we can pass it an NLTK Text object, like our text object above:


In [172]:
p()
t = nltk.Text(nltk.Text(tokens[:21]))
print type(t)
p()
print t
p()
print nltk.pos_tag(t,tagset='universal')


--------------------
<class 'nltk.text.Text'>
--------------------
<Text: Pineapple rock , lemon platt , butter scotch...>
--------------------
[(u'Pineapple', u'NOUN'), (u'rock', u'NOUN'), (u',', u'.'), (u'lemon', u'ADJ'), (u'platt', u'NOUN'), (u',', u'.'), (u'butter', u'NOUN'), (u'scotch', u'NOUN'), (u'.', u'.'), (u'A', u'DET'), (u'sugarsticky', u'ADJ'), (u'girl', u'NOUN'), (u'shovelling', u'VERB'), (u'scoopfuls', u'NOUN'), (u'of', u'ADP'), (u'creams', u'NOUN'), (u'for', u'ADP'), (u'a', u'DET'), (u'christian', u'ADJ'), (u'brother', u'NOUN'), (u'.', u'.')]

If we tag the part of speech of the entire text, we can utilize list comprehensions, built-in string methods, and regular expressions to pick out some interesting information about parts of speech in Chapter 8. For example, we can search for verbs and include patterns to ensure a particular tense, search for nouns ending in "s" only, or extract parts of speech and analyze frequency distributions.

We'll start by tagging the parts of speech of our entire text. Recall above our list of all tokens was stored in the variable tokens:

tokens = nltk.word_tokenize(io.open('txt/08lestrygonians.txt','r').read())

We can feed this to the pos_tag() method to get the fully-tagged text of Lestrygonians.


In [173]:
words_tags = nltk.pos_tag(tokens,tagset='universal')

Now use a list comprehension to extract tags from the word/tag combination (the variable words_tags) and pass the tags to a frequency distribution object.


In [174]:
tag_fd = nltk.FreqDist(tag for (word, tag) in words_tags)
tag_fd.most_common()[:15]

summ = 0
for mc in tag_fd.most_common():
    summ += mc[1]
for mc in tag_fd.most_common():
    print "%s : %0.2f"%( mc[0], (mc[1]/summ)*100 )


NOUN : 28.62
. : 17.08
VERB : 14.45
ADP : 9.48
PRON : 8.62
DET : 7.80
ADJ : 5.31
ADV : 3.95
PRT : 2.24
CONJ : 1.58
NUM : 0.64
X : 0.22

An interesting observation - between the nouns, pepositions, conjunctions, and periods, you've got 45% of the chapter covered.

There is an implicit bias, in simple taggers, to tag unknown words as nouns, so that may be the cause for all the nouns. But Lestrygonians is a more earthy, organic, and sensory chapter, so it would make sense that Joyce focuses more on nouns, on things and surroundings, and that those would fill up the chapter.

We'll see later on that we can explore those parts of speech tags to see whether they were correct and where they went wrong, when we look at multi-word phrases.

By utilizing the built-in FreqDist object, we can explore the most common occurrences of various parts of speech. For example, if we create a FreqDist with a tagged version of Lestrygonians, we can filter the most common words by their part of speech:


In [175]:
# Create a frequency distribution from POS tag data
words_tags_fd = nltk.FreqDist(words_tags)

In [176]:
most_common_verbs = [wt[0] for (wt, _) in words_tags_fd.most_common() if wt[1] == 'VERB']
most_common_verbs[:30]


Out[176]:
[u'was',
 u'said',
 u'is',
 u'be',
 u'have',
 u'see',
 u'are',
 u'do',
 u'had',
 u'know',
 u'did',
 u'say',
 u'get',
 u'has',
 u'would',
 u'coming',
 u'take',
 u'must',
 u'come',
 u'go',
 u'going',
 u'got',
 u'could',
 u'asked',
 u'put',
 u'used',
 u'will',
 u'think',
 u'walked',
 u'passed']

In [177]:
most_common_nouns = [wt[0] for (wt, _) in words_tags_fd.most_common() if wt[1] == 'NOUN']
most_common_nouns[:20]


Out[177]:
[u'Mr',
 u'Bloom',
 u'eyes',
 u'street',
 u'hand',
 u'Flynn',
 u'man',
 u'Nosey',
 u'time',
 u'way',
 u'Byrne',
 u'Davy',
 u'Must',
 u'day',
 u'Mrs',
 u'something',
 u'night',
 u'mouth',
 u'woman',
 u'things']

Using a conditional frequency distribution, we can check on the probability that a particular part of speech will be a particular word - that is, finding the most frequent parts of speech. We do this by tabulating a conditional frequency distribution for two independent variables: the words, and the parts of speech. The conditional frequency distribution can then be queried for all of the words corresponding to a particular part of speech, and their frequencies.


In [178]:
cfd2 = nltk.ConditionalFreqDist((tag, word.lower()) for (word, tag) in words_tags)

Now we can, for example, print the most common numerical words that appear in the chapter, and we can see that one is the most common, followed by two, three, and five.


In [179]:
print cfd2['NUM'].most_common()


[(u'one', 34), (u'two', 27), (u'three', 10), (u'five', 9), (u'thousand', 3), (u'six', 3), (u'seven', 1), (u'it\u2019s', 1), (u'ten', 1), (u'eight', 1), (u'four', 1), (u'middle', 1), (u'devour', 1), (u'110', 1), (u'twentyone', 1), (u'no-one', 1), (u'85', 1)]

Patterns in Parts of Speech

Things get more interesting when we expand the number of words we're looking at to include phrases. For example, the following loop will look for phrases in the form <POS1> <POS2> <POS3> and will print them out:


In [180]:
def pos_combo(pos1,pos2,pos3):
    for i in range(len(words_tags)-2):
        if( words_tags[i][1]==pos1 ):
            if( words_tags[i+1][1]==pos2 ):
                if( words_tags[i+2][1]==pos3 ):
                    print ' '.join([words_tags[i][0],words_tags[i+1][0],words_tags[i+2][0]])

pos_combo('ADJ','ADJ','NOUN')


hungry famished gull
tall white hats
only reliable inkeraser
ankle first day
poor old sot
little brother’s family
other old mosey
huge high door
last broad tunic
Few years’ time
literary etherial people
keep quiet relief
little naughty boy
warm human plumpness
last pagan king
sweetish warmish cigarettesmoke
vegetarian fine flavour
big tour end
fresh clean bread
eggs fifty years
filleted lemon sole
Soft warm sticky
dry pen signature
lovely seaside girls
fine fine straw
long windy steps

In [181]:
pos_combo('ADV','ADV','VERB')


then why is
as well get
not even registered
always feels complimented
Still better tell
never once saw
then you’d have

In [182]:
pos_combo('ADP','VERB','NOUN')


Like getting £
inside writing letters
in pudding time
Like holding water
of bloodhued poplin
in trickling hallways
with sopping sippets
of making money
with juggling fingers

with sopping sippets above shows an example of a mislabeled word - sopping should be labeled as an adjective, but was mislabeled by a tagger that was indiscriminately labeling "ing" words as verbs (getting, writing, holding). Likewise, pudding was accidentally tagged as a verb for the same reason. There's also an -ed word, bloodhued, which was, for some inexplicable reason, tagged as a verb, instead of an adjective.

In any case, we see that there is room for improvement, but it's an interesting way to explore the text nevertheless.

We already explored what parts of speech are the most common in this text. But suppose we're interested, now, in what combinations of parts of speech are the most common. To do this, we can construct a conditional frequency distribution based on the parts of speech.

If we're thinking about three-letter phrases, then we want to tabulate the frequency of particular combinations of three parts of speech show up. We can think of this as a probability distribution of three random variables taking on values from a set of nominal values ('NOUN', 'VERB', etc.). This is equivalent to a three-dimensional space chopped up into bins, with different frequencies occurring in each bin based on the words appearing in the text.

However, because NLTK's built-in conditional frequency object is only designed to handle two-dimensional data, we'll have to be a bit careful. We'll start by picking the first part of speech (thus eliminating one variable). Then we'll tabulate the conditional frequencies the combinations of parts of speech for the remaining two words.


In [183]:
def get_trigrams(tagged_words):
    for i in range(len(tagged_words)-2):
        yield (tagged_words[i],tagged_words[i+1],tagged_words[i+2])

trigram_freq = {}
for ((word1,tag1),(word2,tag2),(word3,tag3)) in get_trigrams(words_tags):    
    if( tag1 in trigram_freq.keys() ):
        trigram_freq[tag1].append((tag2,tag3))
    else:
        trigram_freq[tag1] = [(tag2,tag3)]

adj_cf = nltk.ConditionalFreqDist(trigram_freq['ADJ'])
print adj_cf.tabulate()


        .  ADJ  ADP  ADV CONJ  DET NOUN  NUM PRON  PRT VERB    X 
   .    0    8    7    7    0   10   33    0    5    0   18    1 
 ADJ    4    2    1    1    0    1   26    1    0    0    1    0 
 ADP    2    0    2    1    0   15   11    0   16    0    2    0 
 ADV   16    2    1    0    0    0    1    0    1    1    3    0 
CONJ    0   12    0    0    0    0    1    0    0    0    0    0 
 DET    0    1    0    0    0    0    4    0    0    0    0    0 
NOUN  233   10   95   15   16   10   62    2   26   13   56    1 
 NUM    2    0    0    0    0    0    4    0    0    0    1    0 
PRON    0    0    0    0    0    0    0    0    0    0   10    0 
 PRT    1    0    0    0    0    0    0    0    0    0    3    0 
VERB    3    1    4    2    0    4    4    1    1    2    3    0 
   X    0    0    0    0    0    0    0    0    0    1    0    0 
None

This is a dense, interesting table. This table shows the likelihood of particular parts of speech occurring after adjectives.

To utilize this table, we begin by selecting a part of speech (in this case, adjective) that is the basis for the table. Next, we pick the part of speech of the second word, and select it from the labels on the left side of the table. (That is, the rows indicate the part of speech of the second word in our trigram). Finally, we pick the part of speech of the third word, and select it from the labels on the top of the table. (The columns indicate the part of speech fo the third word in our trigram.)

This gives the total occurrences of this combination of parts of speech.

The first thing we notice is that nouns are by far the most common part of speech to occur after adjectives - precisely what we would expect. Verbs are far less common as the second word in our trigram - the adjective-verb combination is rather unusual. But verbs are much more common as the third part of speech in the trigram.

Suppose we're interested in the combination adjective-pronoun-verb, which the table tells us occurs exactly 10 times. We can print out these combinations:


In [184]:
pos_combo('ADJ','PRON','VERB')


priest they are
inkbottle I suggested
sure she was
nun they say
flat they look
Lucky I had
Italian I prefer
ready he drained
green it would
First I must

The adjective-pronoun-verb combination seems to happen when there is a change or transition in what the sentence is saying - the beginning of a new phrase.

On the other hand, if we look at the adjective-noun-adjective combination, which is similarly infrequent, it shows words that look like they were intended to cluster together:


In [185]:
pos_combo('ADJ','NOUN','ADJ')


fifty yards astern
old mosey lunatic
big establishments whole
right royal old
casual wards full
fifty years old
full lips full
woman’s breasts full
Lean people long
biliary duct spleen

In [186]:
print sanetext
print dir(sanetext)


<Text: Pineapple rock , lemon platt , butter scotch...>
['_CONTEXT_RE', '_COPY_TOKENS', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_context', '_word_context_index', 'collocations', 'common_contexts', 'concordance', 'count', 'dispersion_plot', 'findall', 'index', 'name', 'plot', 'readability', 'similar', 'tokens', 'unicode_repr', 'vocab']

To deal more deftly with this information, and actually make good use of it, we would need to get a little more sohpisticated with how we're storing our independent variables.

This information could be used to reveal what combinations of parts of speech are most likely to be complete, standalone phrases. For example, the adjective-pronoun-verb combination does not result in words that are intended to work together in a single phrase. If we were to analyze three-word phrases beginning with pronouns, for example, we would see that pronouns followed by verbs are very common:


In [187]:
pro_cf = nltk.ConditionalFreqDist(trigram_freq['PRON'])
print pro_cf.tabulate()


        .  ADJ  ADP  ADV CONJ  DET NOUN  NUM PRON  PRT VERB    X 
   .    3    9   14    3    3    8   69    2   21    1   37    1 
 ADJ    4    5    0    1    2    0   55    0    0    0    0    0 
 ADP    5    4    9    1    1   28   22    0   17    0    5    0 
 ADV   15    1    4    2    1    1    0    0    1    2   13    0 
CONJ    0    0    0    0    0    0    1    0    1    0    1    0 
 DET    5    8    3    1    0    1   11    0    0    0    5    0 
NOUN  138    1   34   16   15    6   23    1    7   11   61    0 
 NUM    0    3    0    0    0    0    1    0    0    0    0    0 
PRON    1    1    1    1    0    0    1    0    1    0   15    0 
 PRT    9    2   10    1    2    4    4    0    5    0    4    0 
VERB  109   19   56   47    1   56   29    1   96   34   72    0 
None

The fact that the pronoun-verb combination occurs frequently, but the adjective-pronoun-verb combination does not, indicates that the adjective-pronoun-verb combination is unlikely to be a sensible phrase. By analyzing a few specific sentences and identifying issues with tags, as we did above, it's possible to improve the part of speech tagger. It's also possible to use a hierarchical part of speech tagger - one that starts by looking at three or four neighboring words, and determing the part of speech based on their part of speech. If that fails, it looks at two neighboring words, or one neighboring word, until the simplest case, when it looks at a single word itself. Many of the part of speech tags that failed were fooled by suffixes of words, and ignored their context.

Improving Part of Speech Tagging

To improve the tagging of parts of speech, we can use n-gram tagging, which is the idea that when you're tagging a word with a part of speech, it can be helpful to look at neighboring words and their parts of speech. However, in order to do this, we'll need a set of training data, to train our part of speech tagger.

Start by importing a corpus. The brown corpus, described at the Brown Corpus wikipedia page, is approximately one million tagged English words, in a range of different categories. This is far more complete than the treebank corpus, another tagged (but partially complete) corpus from the University of Pennsylvania (the Treebank project's homepage gives a 404, but see Treebank-3 from the Linguistic Data Consortium).

We'll use two data sets: one training data set, to train the part of speech tagger, using tagged sentences; and one test data set, to which we apply our part of speech tagger, and compare to a known result (tagged versions of the same sentences), enabling a quantitative measure of accuracy.


In [188]:
from nltk.corpus import brown
print dir(brown)


['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_add', '_c2f', '_delimiter', '_encoding', '_f2c', '_file', '_fileids', '_get_root', '_init', '_map', '_para_block_reader', '_pattern', '_resolve', '_root', '_sent_tokenizer', '_sep', '_tagset', '_unload', '_word_tokenizer', 'abspath', 'abspaths', 'categories', 'citation', 'encoding', 'ensure_loaded', 'fileids', 'license', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'tagged_paras', 'tagged_sents', 'tagged_words', 'unicode_repr', 'words']

In [189]:
print brown.categories()


[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']

In [190]:
# Get tagged and untagged sentences of fiction
btagsent = brown.tagged_sents(categories='fiction')
bsent = brown.sents(categories='fiction')

Now split the data into two parts: the training data set and the test data set. The NLTK book suggests 90%/10%:


In [191]:
z = 0.90          # z is the test/train ratio
omz = 1 - z       # one minus z
traincutoff = int(len(btagsent)*z)

traindata = btagsent[:traincutoff]
testdata = btagsent[traincutoff:]

In [192]:
# Import timeit so we can time how long it takes to train and test POS tagger
import timeit

In [193]:
start_time = timeit.default_timer()

# NLTK makes training and testing really easy
unigram_tagger = nltk.UnigramTagger(traindata)
perf1 = unigram_tagger.evaluate(testdata)

time1 = timeit.default_timer() - start_time

print perf1
print time1


0.824455652602
0.989324092865

Different taggers can be combined, in a certain order, like this:


In [194]:
start_time = timeit.default_timer()

t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(traindata, backoff=t0)
t2 = nltk.BigramTagger(traindata, backoff=t1)
perf2 = t2.evaluate(testdata)

time2 = timeit.default_timer() - start_time

print perf2
print time2


0.86615819904
2.56884694099

In [195]:
print "Fitting improvement: %d %%"%(100*abs(perf2-perf1)/perf1)
print "Timing penalty: %d %%"%(100*abs(time2-time1)/time1)


Fitting improvement: 5 %
Timing penalty: 159 %

Hmmmm......

In any case - if we want to save this model, we can save it in a pickle file:


In [196]:
from pickle import dump
output = open('lestrygonians_parser.pkl', 'wb')
dump(t2, output, -1)
output.close()

## To load:
# from pickle import load
# input = open('lestrygonians_parser.pkl', 'rb')
# tagger = load(input)
# input.close()

In [197]:
better_tags = t2.tag(tokens)

print "%40s\t\t%30s"%("original","better")
for z in range(110):
    print "%35s (%s)\t%35s (%s)"%(words_tags[z][0],words_tags[z][1],better_tags[z][0],better_tags[z][1])


                                original		                        better
                          Pineapple (NOUN)	                          Pineapple (NN)
                               rock (NOUN)	                               rock (NN)
                                  , (.)	                                  , (,)
                              lemon (ADJ)	                              lemon (NN)
                              platt (NOUN)	                              platt (NN)
                                  , (.)	                                  , (,)
                             butter (NOUN)	                             butter (NN)
                             scotch (NOUN)	                             scotch (NN)
                                  . (.)	                                  . (.)
                                  A (DET)	                                  A (AT)
                        sugarsticky (ADJ)	                        sugarsticky (NN)
                               girl (NOUN)	                               girl (NN)
                         shovelling (VERB)	                         shovelling (NN)
                          scoopfuls (NOUN)	                          scoopfuls (NN)
                                 of (ADP)	                                 of (IN)
                             creams (NOUN)	                             creams (NN)
                                for (ADP)	                                for (IN)
                                  a (DET)	                                  a (AT)
                          christian (ADJ)	                          christian (NN)
                            brother (NOUN)	                            brother (NN)
                                  . (.)	                                  . (.)
                               Some (DET)	                               Some (DTI)
                             school (NOUN)	                             school (NN)
                              treat (NOUN)	                              treat (VB)
                                  . (.)	                                  . (.)
                                Bad (NOUN)	                                Bad (NN)
                                for (ADP)	                                for (IN)
                              their (PRON)	                              their (PP$)
                            tummies (NOUN)	                            tummies (NN)
                                  . (.)	                                  . (.)
                            Lozenge (NOUN)	                            Lozenge (NN)
                                and (CONJ)	                                and (CC)
                             comfit (VERB)	                             comfit (NN)
                       manufacturer (NOUN)	                       manufacturer (NN)
                                 to (PRT)	                                 to (TO)
                                His (PRON)	                                His (PP$)
                            Majesty (NOUN)	                            Majesty (NN)
                                the (DET)	                                the (AT)
                               King (NOUN)	                               King (NN-TL)
                                  . (.)	                                  . (.)
                                God (NOUN)	                                God (NP)
                                  . (.)	                                  . (.)
                               Save (NOUN)	                               Save (NN)
                                  . (.)	                                  . (.)
                                Our (PRON)	                                Our (PP$-TL)
                                  . (.)	                                  . (.)
                            Sitting (VERB)	                            Sitting (VBG)
                                 on (ADP)	                                 on (IN)
                                his (PRON)	                                his (PP$)
                             throne (NOUN)	                             throne (NN)
                            sucking (NOUN)	                            sucking (VBG)
                                red (ADJ)	                                red (JJ)
                            jujubes (ADJ)	                            jujubes (NN)
                              white (ADJ)	                              white (JJ)
                                  . (.)	                                  . (.)
                                  A (DET)	                                  A (AT)
                             sombre (ADJ)	                             sombre (NN)
                            Y.M.C.A (NOUN)	                            Y.M.C.A (NN)
                                  . (.)	                                  . (.)
                              young (ADJ)	                              young (JJ)
                                man (NOUN)	                                man (NN)
                                  , (.)	                                  , (,)
                           watchful (ADJ)	                           watchful (NN)
                              among (ADP)	                              among (IN)
                                the (DET)	                                the (AT)
                               warm (ADJ)	                               warm (JJ)
                              sweet (NOUN)	                              sweet (JJ)
                              fumes (NOUN)	                              fumes (NNS)
                                 of (ADP)	                                 of (IN)
                             Graham (NOUN)	                             Graham (NN)
                            Lemon’s (NOUN)	                            Lemon’s (NN)
                                  , (.)	                                  , (,)
                             placed (VERB)	                             placed (VBD)
                                  a (DET)	                                  a (AT)
                          throwaway (NOUN)	                          throwaway (NN)
                                 in (ADP)	                                 in (IN)
                                  a (DET)	                                  a (AT)
                               hand (NOUN)	                               hand (NN)
                                 of (ADP)	                                 of (IN)
                                 Mr (NOUN)	                                 Mr (NN)
                              Bloom (NOUN)	                              Bloom (NN)
                                  . (.)	                                  . (.)
                              Heart (NOUN)	                              Heart (NN-TL)
                                 to (PRT)	                                 to (TO)
                              heart (NOUN)	                              heart (NN)
                              talks (NOUN)	                              talks (NN)
                                  . (.)	                                  . (.)
                               Bloo (NOUN)	                               Bloo (NN)
                                ... (.)	                                ... (NN)
                                 Me (NOUN)	                                 Me (NN)
                                  ? (.)	                                  ? (.)
                                 No (NOUN)	                                 No (RB)
                                  . (.)	                                  . (.)
                              Blood (NOUN)	                              Blood (NN)
                                 of (ADP)	                                 of (IN)
                                the (DET)	                                the (AT)
                               Lamb (NOUN)	                               Lamb (NN)
                                  . (.)	                                  . (.)
                                His (PRON)	                                His (PP$)
                               slow (ADJ)	                               slow (JJ)
                               feet (NOUN)	                               feet (NNS)
                             walked (VERB)	                             walked (VBD)
                                him (PRON)	                                him (PPO)
                          riverward (NOUN)	                          riverward (NN)
                                  , (.)	                                  , (,)
                            reading (NOUN)	                            reading (VBG)
                                  . (.)	                                  . (.)
                                Are (NOUN)	                                Are (BER)
                                you (PRON)	                                you (PPSS)
                              saved (VERB)	                              saved (VBN)

The analysis, from here, moves on to phrases and sentences, which will be covered in Part II.


In [ ]: